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Abstract 

In comparing clustering partitions, Rand index (RI) and Adjusted Rand 
index (ARI) are commonly used for measuring the agreement between 
the partitions. Both these external validation indexes aim to analyze how 
close is a cluster to a reference (or to prior knowledge about the data) 
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by counting corrected classified pairs of elements. When the aim is to 
evaluate the solution of a fuzzy clustering algorithm, the computation of 
these measures require converting the soft partitions into hard ones. It is 
known that different fuzzy partitions describing very different structures 
in the data can lead to the same crisp partition and consequently to the 
same values of these measures. 

We compare the existing approaches to evaluate the external validation 
criteria in fuzzy clustering and we propose an extension of the ARI for 
fuzzy partitions based on the normalized degree of concordance. Through use 
of real and simulated data, we analyze and evaluate the performance of 
our proposal. 

keywords: Clustering, cluster validity, fuzzy partitions, external evaluation 
measures, Rand index. Adjusted Rand index 


1 Introduction 

Cluster analysis, broadly speaking, can be defined as an unsupervised method 
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ilar as possible and objects in different clusters are as dissimilar as possible. 
An important distinction can be made between hard and soft clustering algo¬ 
rithms. Hard clustering methods consider disjoint partitions. In other words, 
an object belongs or does not belong to a cluster. More formally, given a data 
set X, the clustering structure can be presented as a set of non empty subsets 
{Cl, • • • , Cfc, • • • , Cjc} such that: 


Q n Cfc/ = 0 , 
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When the data presents no sharp boundaries between clusters, fuzzy cluster¬ 
ing algorithms should be preferred. These metho ds determine for each obiec 


a degree of membership to belong to every cluster (|Ruspini 


1984 


Hoppner et al . 


1970 


Bezdeketal 


1999I) . In this way, objects that are on the boundary be¬ 


tween different clusters are not forced to belong to one specific cluster, but 
they present a different degree of membership for each cluster. More formally, 
these methods partition the elements in X in fC fuzzy overlapping clusters, with 
respect to some defined criterion, and they returns both a set of cluster centers 
and a partition matrix of the following form 


K 


W = {^i,k}{nxK) e [0,1]; E Wi,k = 1 Vi G {1, ■ 

k=l 


,n}, 


( 1 . 2 ) 


in which represents the degree to which the element x, belongs to the 
cluster Q. 

Cluster validation involves both internal and external validation criteria. The 
internal validation criteria are based only on the observation on the clustered 
data while the external ones are defined on some information that is not used 
in the clustering production, i.e. a golden standard cluster structure known 
a priori. Several external validation criteria have been proposed in the liter¬ 
ature. These scalar indexes assess the goodness of the partition obtained by 


the clustering procedure on the base of previous knowledg e abou 


Among these measures we can m ention the Rand index jRand . 


Fowlk es and Mallows J^-measure (|Fowlkes and Mallowsl. Il983), the Jaccard 


index (jPownton and Breimanl. 1198011 . the Mirkin metric (IMirkin 
Dice coefficient (|Dicel. Il945|l . 


the data. 


197lh . the 


1998h and the 


The remainder of the paper is or ganized as follows: i n Sec tion [2] the Rand in¬ 
dex and the adjusted version by llTubert and Arabiel (|l985h are presented. In 
Section |3] some approaches to extend the Rand index and the Adjusted Rand 
index to fuzzy partitions are presented. Section H] is devoted to introduce and 
explain the Adjusted Concordance index. Section |5] is dedicated to the analysis 
of the performance of the proposed index. Concluding remarks close the paper 
in Section 
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2 Rand index. Adjusted Rand index and related mea¬ 
sures 


The Rand index is an external evaluation measure developed by iRandl (|l97lh to 
compare the clustering partitions on a set of data. 

Let X = be the data matrix, where n is the number of objects and p 

the number of variables. A partition of the n objects in K subsets or groups, 
P = {Pi„..., Px}, can be formed in such a way that the union of all the subsets 
is equal to the entire data set and the intersection of any two subsets is the 
empty set. 

It is possible to say that two elements of X, i.e. (x,x') are paired in P if they 
belong to the same cluster. Let P and Q be two partitions of the objects set X. 
The Rand index is calculated as: 


R1 = 


a d 


a d 


a b c d 


(2) ' 


( 2 . 1 ) 


where 


• a is the number of pairs (x, x') G X that are paired in P and in Q; 

• b is the number of pairs (x, x') G X that are paired in P but not paired in 

Q; 

• c is the number of pairs (x, x') G X that are not paired in P but paired in 

Q; 

• d is the number of pairs (x, x') G X that are neither paired in P nor in Q. 

This index varies in [0,1] with 0 indicating that the two partitions do not agree 
on any pair of elements and 1 indicating that the two partitions are exactly 
the same. Unfortunately the Rand statistic approaches its upper limit as the 
number of clusters increases. 


There are some other known problems with the Rand index (jMeilal. 120070 : 


The expected value of the Rand index for two random partitions does not 
take a constant value; 
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It presents high variability and, as proved by lFowlkes and MallowsI (|1983h . 
it concentrates in a small interval close to 1; 


It is extremely sensi tive to the number o 
tition (as proved by 
overall number of observations considered. 


grou ps considered in each par- 


Morey and Agrestil (|l984lb . their size and also to the 


To overcome these problems iHubert and Arabiel (|l985h proposed a corrected 
version of fhe Rand index assuming the generalized hypergeometrical distri¬ 
bution as model of randomness (i.e. P and Q are picked at random with a fixed 
number of partitions and a fixed number of elements in each). In other words, 
this corrected version is equal to the normalized difference of the Rand index 
and its expected value under the null hypothesis: 


ARI = 


Index — Expected Index 
Maximum Index — Expected Index’ 


( 2 . 2 ) 


For further details we refer fo iHuberf and Arabiel (|l985r> . More formally the 
Hubert-Arabie's formulation of the adjusted Rand index is: 




2{ad — be) 


b'^ c'^ 2ad (fl -|- d) (c -|- b) 


(2.3) 


This index has an upper bound of 1 and takes the value 0 when the Rand 
index is equal to its expected value (under the generalized hypergeometric 
distribution assumption for randomness). Negative values are possible buf nof 
interesting since they indicate less agreement than expected by chance. 

As regarding the other related comparison measures, all of fhem can also be 
expressed in ferms of the four cardinalities a, b, c and d. The Jaccard index, also 
known as Tanimofo coefficienf, is equal to J = „ ,1, the Fowlkes-Mallow 
J^-index is equal to T = , , , 

M. = 2{b + c) and the Dice coefficient is equal to ^ ^ 
description of these related indexes we refer to IWagner and Wagner! (|2007j) . 


a+b+c' 

the Mirkin metric can be written as 


2a 


. For a more extensive 
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3 Extensions of the Rand index to fuzzy partitions 


The problem of evaluating the solution of a fuzzy clustering algorithm with 
the Rand index is that it requires converting the s oft partition into a hard 
one, in this way, losing information. As shown in ICampellol (120070 . differ¬ 
ent fuzzy partitions describing different structures in the data may lead to the 
same crisp partition and then in the same Rand index value. For this loss of in¬ 
formation the Rand index is not able to discriminate between overlapping and 
non-overlapping clusters. Therefore it is not appropriate for fuzzy clustering 
assessment. 


Campellol (|2007|l proposed a fuzzy extension of the Rand index and related 


indexes by defining a set-theoretic form to calculate the four cardinalities. His 
main goal was to compare a fuzzy partition with a non-fuzzy one, but as he 
notes himself, his measure can also used to compare two fuzzy partitions. Un¬ 
fortunately this measure fails in satisfying reflexivity (i.e. the extension of the 
Rand index calculated between two identi cal partition s is les s than 1), and thus 
it carmot be considered a proper metric. iFrigui et all (|2007n proposed a simi¬ 
lar measure, which can be considered a special case of Campello's one, that 
also can not be consider ed a proper metric since it also fails in satisfying re¬ 
flexivity. iBrouwerl (|2009|l proposed an alternative extension of the Rand index 
and related measures, based on the cosine correlation as measure of bonding 
(or similarity) between two items with fuzzy membership vectors. Also this 
me asure unfortunately y iolets the reflexivity condition. 

Anderson et al (|2010 1 proposed a fuzzy generalization of the Rand index 


and other measures between soft partitions (i.e. fuzzy and possibilistic parti¬ 
tions) based on matrix operations that presents a clear advantage in terms of 
efficiency since it does not consider all pairs of objects involved in the calcu¬ 
lation of the four cardinalities necessary to calculate these indexes. Unfortu¬ 
nately, also in this case, as the authors stated, their generalization is a similarity 
measure that carmot be interpreted as a metric. Fo r a more extensive discu s¬ 
sion on these aforement ioned proposals, we refer to iHiillermeier et all (|2ni2h . 


Hiillermeier et all (|2ni2h proposed a generalization of the Rand index and of 


the related measures, namely the Jaccard measure and the Dice coefficient. 

Let P = {Pi,..., Px} be a fuzzy partition of the data matrix X. Each item x G X 
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is then characterized by its membership vector: 


P(x) = (Pi(x),P 2 (x),...,Px(x)) G [0,1]^ 


(3.1) 


where P!(x) is the membership degree of x in the i-th cluster P,. Given any 
pair (x, x') G X, they defined a fuzzy equivalence relation on X in terms of 
similarity measure as: 

Ep = l-||P(x)-P(x')||, (3.2) 


where || • || is the normalized Li-norm, which constitutes a proper metric on 
[0,1]^ and yields value in [0,1]. Ep is equal to 1 if and only if x and x' have the 
same membership pattern and is equal to 0 otherwise. 

Given 2 fuzzy partition, P and Q, the basic idea underneath the fuzzy exfension 
of the Rand index is to generalize the concept of concordance in the following 
way. Gonsidering a p air (x,x') as being concq rdanf as P and Q agree on its 
degree of equivalence, iHullermeier ef all (|2ni2h define the degree of concordance 
as: 


conc{x,\') = 1 — ||Ep(x,x') — Eq(x,x')|| G [0,1], 


(3.3) 


and the degree of discordance as: 

disc{x,x') = ||Ep(x,x^) — Eq{x,x') 


(3.4) 


The distance measure is then defined by the normalized sum of concordant 
pairs: 


d(P,Q) 


E(x,x06xl|£p(x,xO -Eq(x,xO 
n{n — l)/2 


(3.5) 


The direct generalization of the Rand index corresponds to the normalized 
degree of concordance (NDG) and it is equal to: 


R£(P,Q) = l-d(P,Q), 


(3.6) 


and it reduces to the original Rand index when partitions P and Q are non- 
fuzzy. 

This distance is a pseudo-metric, since it always satisfies the conditions of non- 
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negativity, reflexivity, symmetry, triangle inequality and it is a metric when 
we consider particular assumptions (which can be summarized in considering 
Ruspini's partitions, the existence of a prototypical element for each cluster 
and the equivalent relation on X: Ep(x,x') = 1 — ||P(x — P(x')) because it also 
satisfies the separation condition). 

Since we are interested in comparing fuzzy partitions and the adjusted Rand 
index proposed by Hubert and Arabie is still the most popular measure used 
for clusterings comparison, we propose an extension of this index to fuzzy 
partitions, namely the Adjusted Concordance Ind e x, bas ed on the fuzzy variant 
of the Rand index proposed by Hiillermeier et all ( 2012 1. These authors indeed 
proposed the extension of a large number of related comparison measures, 
which can be expressed in terms of the cardinals a,b,c and d, through the 
formalization of these cardinals in fuzzy logic concordance terms. 

These cardinals can be expressed as follows: 

• fl-concordance: objects x and x' are concordant because their degree of 
equivalence in P and in Q is similar and their degree of equivalence in P 
is high and their degree of equivalence in Q is high 

a = T{l- |£p(x,x') - Eq(x,x')|,T(Ep(x,x'),£q(x,x')); 

• d-concordance: negation of a-concordance (objects x and x' are concor¬ 
dant but either the degree of equivalence in P is not high or the degree 
of equivalence in Q is not high) 

d = T(1 — |Ep(x,x') — Eq(x, x')|, _L(1 — Ep(x,x'),l — Eq{x,x')); 


^-discordance: the degree of equivalence of x and x' in P is larger than 
that in Q 

h = max(£p(x,x') — Eq(x,x'),0); 

c-discordance: the degree of equivalence of x and x' in P is smaller than 
that in Q 

c = max(EQ(x,x') — £p(x, x'),0); 
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where T is the triangula r product norm and _L is the associated triangular 
conorm (algebraic sum) (|Klement et all. 12013h . The cardinals just mentioned 
can be also expressed as: 


a = (1 — \Ep(x,x') — Eq(x,x')\) ■ Ep{x,x') • Eq{x,x') 
d= {1- |£p(x,x') - Eq{x,x')\) ■ (1 - Ep(x,x') • Eq{x,x')) 
b = max{E'p{x,x') — Eq{x,x'),Q) 
c = max{EQ{x,x') — Ep(x,x'),0) 

(3.7a) 


4 The adjusted concordance index 


Htillermeier et all ( 2OI2I1 did not propose explicitly an extension of the adjusted 
Rand index. Our idea of an extension of the normalized degree of concor¬ 
dance (NDC) was born when we noticed that the other proposed extensions 
of the Rand index to fuzzy partitions were based upon the generalization of 
the four ca rdinalities presented in a standard conti ngency table to compare 2 
partitions. IHullermeier, Rifqi, Henzgen, and Senger s proposal of a fuzzy ver¬ 
sion of the Rand index instead is based on the fuzzy equivalence relation and 
this allows us to rewrite every partition as a similarity matrix based on the 
normalized city block distance. For example, if we consider the following two 
crisp partitions 
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we obtain that Ep and Eq are equal to 
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This enables us to calculate the four cardinalities, a, b, c and d, by considering 
pairs of objects that are paired in both partitions, pairs of objects that are not 
paired in both partitions, and pairs of objects that are paired in a partition 
but not in the other and vice versa, obtaining, in this simple example: a = 1, 
b = 2, c = 1 and d = 2. Using the standard formulation of the contingency 
table to compare two part itions, it is possible t o obtain exactly th e same values 
for the four cardinalities ([Hubert and Arabiel. 1 19851: iMeilal. 120070 Then we can 
see that the Rand index and the adjusted Rand index for this toy example are 
respectively equal to RI = 0.5 and ARI = 0. 

Following the same line of reasoning, when we consider fuzzy partitions the 
elements in the matrices Ep and Eg are, of course, real numbers between 0 and 
1 and they represent respectively similarities between pairs of objects in the 
same partition. Consequently, the normalized degree of concordance can be 
seen as the similarity measure between two similarity matrices. For instance, 
considering two random fuzzy partitions of a set of n = 4 objects 


0.29 

0.71' 


0.94 

0.06' 

0.79 

0.21 

and Q' = 

0.05 

0.95 

0.41 

0.59 

0.53 

0.47 

0.88 

0.12 


0.89 

0.11. 


we obtain that Ep/ and Eg/ corresponds to 


"1.00 

0.50 

0.88 

0.41' 


’1.00 

0.11 

0.59 

0.95' 

0.50 

1.00 

0.62 

0.91 

and Eo' = 

0.11 

1.00 

0.52 

0.16 

0.88 

0.62 

1.00 

0.53 

Sc 

0.59 

0.52 

1.00 

0.64 

.0.41 

0.91 

0.53 

1.00. 


.0.95 

0.16 

0.64 

1.00. 


resulting in a NDC = 0.6367. 

It can be noted that Ep and Eg are symmetric matrices with a number of unique 
similarities equal to m = ”^”2 

Unfortunately it is not possible to adjust the NDC by considering an adjust¬ 
ment of the four cardinalities. The formulation of the Adjusted Rand index by 
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Hubert and Arabia (|l985h in terms of the four cardinalities, reported in Equa¬ 
tion 12.31 was derived as a simplification of 


^RIha — 


U,jO)-LdY)LjC^0/C2) 
om ("r) + ^ c^’)] - E, (T) Ei c^’)/i2y 


(4.1) 


where n is the number of objects, n,_|_ and n^j are respectively the rows and 
the columns marginals of the contingency table obtained by crossing two crisp 
partition vectors. When deali ng with fuzzy partition s it is not straightforward 
to obtain contingency tables. lAnderson et all (|2010lh . following the same ap¬ 
proach of [Hubert and Arabid (|l985h . obtained a fuzzy generalization of the 
contingency tables, but the drawback of this generalization is that neither the 
marginals nor the elements of the tables are integers and then some of the car¬ 
dinalities can be negative. This fact makes not straightforward the use of the 
binomial coefficients. 

Therefore, the key idea is to use the NDC and normalize the difference be¬ 
tween the NDC and its expected value. We estimate the expected value of 
the NDC by considering the average value of the index after permuting the 
elements of each upper triangular similarity matrices, given the partitions and 
a certain number of groups. When the number of pairwise similarities (m) 
is small, the estimate of the expected value is based on considering all possi¬ 
ble permutations (tnl), while, when the number of the pairwise similarities is 
large, we estimate the expected value by taking in account h randomly selected 
permutations on the total ml permutations. 

For the toy example, we have that the expected value of the NDC, consid¬ 
ering all possible permutations of the upper triangular similarity matrices 
{ml = 720), corresponds to 0.6972. Hence, the ACI is equal to 


AC7 = 


NDC — expected NDC 0.6367 — 0.6972 
1 — expected NDC 


1 - 0.6972 


= - 0 . 200 . 


(4.2) 


Similarly to Hubert and Arabiei s ARI, negative values of the ACI are possible 
but not interesting since they indicate less agreement than expected by chance 
and then the index can be set equal to zero. 

It is worth stressing that we can correct the NDC according to the proposed 
approach because it is the only extension of the Rand index to fuzzy partitions 
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which fulfill the reflexivity property that always guarantees that its maximum 
value is equal to one. 

It is worth mentioning that we made an experiment to evaluate the bias in our 
estimate of the expected value of the NDC. In this experiment we generated 
1000 data sets with a random sample size n between 100 and 1200 , a random 
number of clusters C, randomly chosen between 2 and 10, a random number of 
dimensions between 2 and 10, and a covariance matrix E = 7a with a between 
0.1 and 3. We stored for each data set the composition of the clusters and we 
performed a clus ter analysis using th e K-means algorithm. For each solution 


we calculated the 


Hubert and Arabiei s adjusted Rand index and the ACL The 
average difference between these two measures resulted to be —4.5602 x 10“*^^. 


5 Experimental evaluation through simulated and 
real data set analyses 

5.1 Comparing fuzzy and crisp partitions 

For the first simulation study, we generated data with C = 2,3,4 cluster centers 
by incrementally merging four different bi-variate normal distributions with 
mean vectors 


H = [- 2 ,- 2 ]; 

7^2 = [ 2 , 2 ]; 

7^3 = [O'O]' 

7^4 = [- 2 , 2 ], 

and three different levels of variability described by the following covariance 
matrices: Ei = I x 0.01, E 2 = I x 0.25, E 3 = I x 1. The structure of the 
simulated data sets is presented in Table [H The sample size was set equal to 
100 . 

We then generated three data sets with C = 2,3,4 by sampling at turns from 
the same hi-variate normal distribution with mean vector equal to [ 0 , 0 ] and 
covariance equal to I x 0.8. For each data set, we stored the crisp membership 
matrix and we ran the fuzzy C-mean algorithm for each of them by setting 
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Table 1: First simulation study 1: data sets structure 


Data set number 

Centers 

1 

Fl' ^2 

2 

Add ^<3 to data set 1 

3 

Add to data set 2 


C = 2,3,4. We computed the normalize d degree of con cor dance (NDC), the 
extensi on of the Rand i ndex proposed by Ibtouwm (2009), by Campello ( 2007 ) 
and by I Anderson et all (| 2010 l) . 


Table 2: Extensions of the Rand index. Comparison between Htillermeier el al., Brouwer. Campello, Anderson 
et al. 



NDC 

Brouwer 

Campello 

Anderson 

2 Centers, Ei 

0.9992 

0.9995 

0.9992 

0.9989 

2 Centers, E 2 

0.9777 

0.9846 

0.9779 

0.9709 

2 Centers, E 3 

0.9097 

0.9280 

0.9101 

0.8857 

3 Centers, Ei 

0.9961 

0.9977 

0.9970 

0.9955 

3 Centers, E 2 

0.9487 

0.9630 

0.9511 

0.9417 

3 Centers, E 3 

0.8545 

0.8617 

0.8486 

0.8393 

4 Centers, Ei 

0.9947 

0.9972 

0.9968 

0.9944 

4 Centers, E 2 

0.8820 

0.9125 

0.9154 

0.8888 

4 Centers, E 3 

0.7187 

0.7350 

0.7639 

0.7598 

Random 2 Centers 

0.4963 

0.4954 

0.4979 

0.4953 

Random 3 Centers 

0.4982 

0.4720 

0.5318 

0.5527 

Random 4 Centers 

0.5231 

0.4910 

0.5721 

0.6207 


As can be noted from Table |2l the behavior of the extensions of the Rand 
index in all simulated data sets is quite similar when these indexes are used to 
compare a fuzzy partition with the known crisp partition. We also c omputed 
the A C I and the extens ions of the Adjusted Rand index proposed by 


Brouwer 


J 2 OO 9 ): Campello ( 2007 ) and Anderson et al ( 2010 ). Unsurprisingly, as can be 
noted from Table |3l, a conclusion similar to the previous one can be reached. 


5.2 Comparing two fuzzy partitions 

For the second simulation study, we generated 7 data sets with C = 2,3,..., 8 
cluster centers by incrementally merging eight different bi-variate normal dis- 
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Table 3: Fuzzy extensions of adjusted Rand index. Comparison between ACI, Brouwer, Campello, and Anderson 
et al. 



ACI 

Brouwer 

Campello 

Anderson 

2 Centers, Ey 

0.9984 

0.9989 

0.9984 

0.9978 

2 Centers, E 2 

0.9555 

0.9693 

0.9557 

0.9419 

2 Centers, E 3 

0.8196 

0.8561 

0.8202 

0.7714 

3 Centers, Ey 

0.9912 

0.9948 

0.9932 

0.9897 

3 Centers, E 2 

0.8852 

0.9181 

0.8906 

0.8676 

3 Centers, E 3 

0.6848 

0.7104 

0.6714 

0.6391 

4 Centers, Ey 

0.9855 

0.9923 

0.9913 

0.9848 

4 Centers, E 2 

0.7025 

0.7844 

0.7824 

0.6974 

4 Centers, E 3 

0.3607 

0.4282 

0.4311 

0.3491 

Random 2 Centers 

-0.0047 

-0.0043 

-0.0042 

-0.0094 

Random 3 Centers 

-0.0055 

-0.0062 

-0.0076 

-0.0167 

Random 4 Centers 

0.0005 

0.0011 

0.0008 

-0.0188 


tributions with mean vectors 


F2 = [ 2 / 2 ]; 
F 3 = [O'O]' 
[- 2 / 2 ]; 
[ 2 /- 2 ]; 

Fe ~ 

F? ~ [^/ 

Fs = [ 9 / 9 ]; 


and covariance matrix equal to I x cc', where a is a vector made of two draws 
from a uniform distribution in (0.1,1). The structure of the data sets is pre¬ 
sented in Table HI The sample size was set, this time, to 120. 

For each data set w e ran the fuzzy C-m eans algorithm by setting C = 2,3,..., 8 . 
Then we computed 


Hiillermeier et al NDC and the ACI between the returned 


fuzzy partitions for each data set and for C = 2,3,..., 8 . We decided to not 
sh ow in this case the other va r iants for R and index and Adjust ed Rand index 


by iBrouweii (|2009r) : ICampellol (|2007ll and 


Anderson et al 


(I2OI0I1 


since, even if 


they can be applied to compare two fuzzy partitions, they lead to inconclusive 
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Table 4: Second simulation study: data sets structure 


Data set number 

Centers 

1 

^1, fl2 

2 

Add to data set 1 

3 

Add ^4 to data set 2 

4 

Add ^5 to data set 3 

5 

Add to data set 4 

6 

Add ^7 to data set 5 

7 

Add to data set 6 


measures (i.e. these indexes do not s a tisfy the reflexiv i ty condition. For further 

Anderson et al ([2010 ): 


detai 


(I 2 m 2 h . 


s, we refer to iCampellol (120070 : 


Hiillermeier et al 


Table 5: Comparing two fuzzy partitions: normalized degree of concordance 



C2 

C=3 

C=4 

C=5 

C=6 

C=7 

C=8 

Data set 1 

1.0000 

0.9012 

0.8423 

0.7895 

0.7607 

0.7558 

0.7236 

Data set 2 

0.8125 

1.0000 

0.8945 

0.8853 

0.8610 

0.8252 

0.8182 

Data set 3 

0.7027 

0.8681 

1.0000 

0.9304 

0.9083 

0.8925 

0.8687 

Data set 4 

0.7024 

0.8346 

0.9201 

1.0000 

0.9668 

0.9291 

0.9157 

Data set 5 

0.6850 

0.8231 

0.8820 

0.9275 

1.0000 

0.9511 

0.9286 

Data set 6 

0.6173 

0.7582 

0.8438 

0.8944 

0.9513 

1.0000 

0.9728 

Data set 7 

0.4540 

0.6973 

0.8084 

0.8776 

0.9177 

0.9551 

1.0000 


Both Tables and show that NDC and ACI are equal to 1 when comparing 
the same partitions. In both Tables we note that the value of the indexes is 
larger when the number of estimated partitions is close to the true generated 
partitions. It can be noted that both indexes tend to return slightly larger values 
when estimating a number of partitions higher than number of true partitions. 
But the magnitude of the ACI is more realistic than the one of the NDC. For 
example, in the first row of Table |5l, the value of the NDC is equal to 0.7236 
when we compare the fuzzy partitions of a data set with two centers with the 
solutions with C=8 estimated centers, while in the same situation (first row of 
Table [D ACI=0.4093. The same conclusion can be reached when we compare 
the last row of these Tables (NDC=0.4540, ACI=0.1002). It is worth noting 
that these differences in magnitude are due to the fact that while the NDC 
still considers the partitions quite similar to each other and, for this reason. 
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Table 6: Comparing two fuzzy partitions: Adjusted Concordance index 



C=2 

C=3 

C=4 

C=5 

C=6 

C=7 

C=8 

Data set 1 

1.0000 

0.7895 

0.6606 

0.5484 

0.4879 

0.4768 

0.4093 

Data set 2 

0.5550 

1.0000 

0.7196 

0.6889 

0.6188 

0.5092 

0.4870 

Data set 3 

0.2743 

0.6221 

1.0000 

0.7774 

0.7000 

0.6416 

0.5539 

Data set 4 

0.3319 

0.5603 

0.7685 

1.0000 

0.8964 

0.7652 

0.7188 

Data set 5 

0.2624 

0.4906 

0.6230 

0.7530 

1.0000 

0.8235 

0.7318 

Data set 6 

0.1755 

0.3461 

0.5082 

0.6399 

0.8201 

1.0000 

0.8927 

Data set 7 

0.1002 

0.2751 

0.4457 

0.5863 

0.6985 

0.8254 

1.0000 


not so far away, the ACI, by estimating the expected value of the index to 
take into account the model of randomness, informs that part of the similarity 
detected by NDC is due to chance. It is worth highlighting that this caimot 
be stated for the extensions of the Adjusted Rand index proposed by the other 
authors since they used the standard formulation of the ARI by modifying the 
four cardinalities, but this does not guarantee that the expected values of their 
fuzzy Rand indexes is correctly identified. 


5.3 Comparing estimated and "true" fuzzy partitions 


To explain what we mean for "true" fuzzy partition we introduce the Probabilistic- 
Distance (PD) clustering (|Ben-Israel and lyigun . 2008 1. The PD clustering al¬ 
lows for a probabilistic allocation of cases to classes or clusters. It is a form 
of fuzzy clustering that is independent on the specification of fuzzifiers. It is 
based on the principle that probability and distance are inversely related 


Pfc(x)djt(x) = constant, depending on x. 


(5.1) 


in which rf is a distance measure between the j-th individual and the fc-th 
cluster center and Pfc(x) denotes the probability of the /-th individual to belong 
to the k-th cluster, for k = 1,... ,K. 

Equation 15.11 allows to define the membership probabilities as 


Pfc(x) 


ElLi n^, ^(x) 


(5.2) 
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Using Equation 15.21 we are able to determine for a real or a simulated data set 
the "true" fuzzy partitions, provided that we know a priori the composition 
of the clusters, and then to compute the indexes between the estimated and 
"true" fuzzy partitions. We should note that we removed all categorical vari¬ 
ables from these data and we used always the same setting based on Euclidean 
distance. 

As a first experiment we ran the algorithms on the same data sets used before. 
Even if we decided to not us e in the previou s experiment the f uzzy extensions 
of the indexes proposed by iBrouweil (|2009h : ICampellol (120070 : lAnderson et al 
(|2010ll for comparing fuzzy partitions, in this case we also included these in¬ 
dexes. By looking at both Tables [71 and |9[ it seems that all these indexes have a 
similar behavior. Note that, while it is expected that the lower the variance the 
lower the magnitude of the indexes, it seems that there is a considerable differ¬ 
ence among the index es for different levels of the covariance matrix. Moreover, 
as also pointed out by Anderson et all ( 2010 k these indexes are not comparable. 


Table 7: Comparing fuzzy partitions: comparison of the estimated membership probabilities and the true fuzzy 
partition. 



NDC 

ACI 

data set 

Si 

S2 

S3 

Si 

S2 

S3 

2 centers 

0.9988 

0.9922 

0.9808 

0.9975 

0.9810 

0.9413 

3 centers 

0.9963 

0.9834 

0.9812 

0.9908 

0.9524 

0.9405 

4 centers 

0.8557 

0.9639 

0.8915 

0.6201 

0.8639 

0.5038 


Brouwer Rand 

Brouwer Adjusetd Rand 

2 centers 

0.959507 

0.834307 

0.760977 

0.918929 

0.654738 

0.429033 

3 centers 

0.913283 

0.762435 

0.697006 

0.814727 

0.524424 

0.37679 

4 centers 

0.712589 

0.669445 

0.652062 

0.423158 

0.337953 

0.191748 


Campello Rand 

Campe 

lo Adjusted Rand 

2 centers 

0.950726 

0.804091 

0.686992 

0.901445 

0.608165 

0.373974 

3 centers 

0.91354 

0.757514 

0.665499 

0.811207 

0.499784 

0.325355 

4 centers 

0.73316 

0.701056 

0.584516 

0.424566 

0.370523 

0.156312 


Anderson Rand 

Anderson Adjusted Rand 

2 centers 

0.922612 

0.718153 

0.592507 

0.845209 

0.436248 

0.184937 

3centers 

0.854373 

0.69217 

0.63614 

0.668937 

0.307852 

0.203322 

4 centers 

0.732508 

0.667434 

0.632849 

0.289444 

0.095834 

0.002191 
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Table 8: Comparing fuzzy partitions: both partitions are the true fuzzy partitions. 



NDC 

ACI 

data set 

Si 

E2 

S 3 

Si 

S 2 

S 3 

2 centers 

1.0000 

1.0000 

1.0000 

1.0000 

1.0000 

1.0000 

3 centers 

1.0000 

1.0000 

1.0000 

1.0000 

1.0000 

1.0000 

4 centers 

1.0000 

1.0000 

1.0000 

1.0000 

1.0000 

1.0000 


Brouwer Rand 

Brouwer Adjusetd Rand 

2 centers 

0.9595 

0.8344 

0.7610 

0.9189 

0.6550 

0.4287 

3 centers 

0.9132 

0.7619 

0.6971 

0.8146 

0.5233 

0.3768 

4 centers 

0.8947 

0.6681 

0.6623 

0.7493 

0.3349 

0.2127 


Campello Rand 

Campe 

lo Adjusted Rand 

2 centers 

0.9508 

0.8055 

0.6913 

0.9016 

0.6109 

0.3826 

3 centers 

0.9138 

0.7597 

0.6692 

0.8119 

0.5044 

0.3329 

4 centers 

0.9112 

0.7056 

0.6124 

0.7830 

0.3801 

0.2112 


Anderson Rand 

Anderson Adjusted Rand 

2 centers 

0.9225 

0.7182 

0.5919 

0.8451 

0.4363 

0.1837 

3 centers 

0.8542 

0.6912 

0.6357 

0.6685 

0.3057 

0.2024 

4 centers 

0.8419 

0.6658 

0.6331 

0.5696 

0.0913 

0.0018 


Both Tables [8] and [TO] show the indexes computed compari ng the true fuzzy 
partition with itself. In this case, the reflexivity property of the lHiillermeier et all 's 
NDC is emphasized. We think now it is straightforward that the adjusted ver¬ 
sions of the fuzzy Rand-like indexes do not make sense if these are computed 
using Formula 12.31 In some cases the Adjusted Rand-like index comparing the 
sar ne partition is ri egative and it should be set equal to zero (as in the case of 


the 


Anderson et all s index in Table [TOll . 


Table 9: Comparing fuzzy partitions: comparison of the estimated membership probabilities and the true fuzzy 
partition. 


data set 

NDC 

ACI 

Brouwer 

Rand 

Brouwer 
Adj. Rand 

Campello 

Rand 

Campello 
Adj. Rand 

Anderson 

Rabd 

Anderson 
Adj. Rand 

Randoml 

0.8125 

-0.0055 

0.8789 

-0.0017 

0.4999 

-0.0002 

0.4949 

-0.0102 

Random2 

0.7836 

0.0478 

0.7867 

0.0130 

0.5106 

0.0206 

0.5514 

-0.0195 

Randoms 

0.8250 

0.1242 

0.7685 

0.0450 

0.5152 

0.0283 

0.6218 

-0.0290 


As a second experiment we used real data sets taken from the UCI reposi¬ 
tory for machine learning. Results are summarized in Table [HI 

We would like to point out that it is not our intention to use the extensions 
of the Rand and of the Adjusted Rand index to identify the number of clusters 
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Table 10: Comparing fuzzy partitions: both partitions are the true fuzzy partitions. 


data set 

NDC 

ACI 

Brouwer 

Rand 

Brouwer 
Adj. Rand 

Campello 

Rand 

Campello 
Adj. Rand 

Anderson 

Rabd 

Anderson 
Adj. Rand 

Randoml 

1.0000 

1.0000 

0.9886 

0.0141 

0.5086 

0.0173 

0.4950 

-0.0102 

Random2 

1.0000 

1.0000 

0.9493 

0.0599 

0.5147 

0.0293 

0.5512 

-0.0202 

Randoms 

1.0000 

1.0000 

0.8661 

0.1352 

0.5271 

0.0532 

0.6216 

-0.0294 


to use since these indexes are external validity measures. As can be noted from 
Table [TT] and as already pointed out, all the indexes, apart from NDC and ACI, 
do not respect the reflexivity property Furthermore, in some cases, the value 
of the index estimated when we compare the same partition (first and third col¬ 
umn) is lower than the value obtained when compari ng the estimated partition 
with the true partition (e.g. in the Vehicle data set the I Anderson et all approach 
for both the Rand and the adjusted Rand extensions). Nevertheless, even if it 
is not correct to compare these indexes (as also stated ir iAnderson et all dZOlOj)), 
it seems that the behavior of the corrections applied to the adjusted extensions 
with respect to its Rand extensions is quite similar. Of course, the interpre¬ 
tation of all these corrections as a correction for randomness is possible, as 
already stated, only when we consider our approach with respect to NDC. For 
instance, in the case of the Sonar data set, NDC is equal to 0.9647 when com¬ 
paring the true and the estimated partition and ACI is equal to —0.0001. These 
results could seem inconsistent, but if we take into account the true partition, 
we can notice that each object has a probability to belong to each cluster that 
is really close to 0.5 and the membership probabilities estimated by the PD- 
clustering algorithm are also close to 0.5. In sight of this it is obvious that a 
NDC close to 1 means that these two partitions are really close to each other. 
On the other hand, probabilities close to 0.5 are equivalent to a coin flipping 
experiment and, for this reason, it is not surprising that ACI is close to zero. 
On the contrary, for the Iris data set, NDC is 0.9780 and ACI is 0.9298. These 
results indicate that the estimated partition is really close to the true one, but, 
in this case, this similarity is not due to chance. It is worth stressing again 
that our goal is not to evaluate clustering algorithm, but only the behaviour of 
these external validation indexes. 
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Table 11: Comparison between estimated and true probabilistic partitions. Extended Rand indexes, on the left 
side, and extended adjusted Rand indexes on the right side. For each side of the table, the first column represents 
the index computed between the true probabilistic partition and itself, while the second column represents the 
index computed between the true probabilistic partition and the one returned by the PD-clustering algorithm by 
setting the number of clusters equal to the known number of clusters, indicated in brackets 


Data sets 

Rand 

extensions 

True 

fuzzy 

partition 

C 

clusters 

solution 

Adjusted 

extensions 

True 

fuzzy 

partition 

C 

clusters 

solution 


Anderson 

0.6227 

0.6261 

Anderson 

0.0223 

0.0272 

Vehicle 

Campello 

0.5699 

0.5325 

Campello 

0.1397 

0.0656 

(C=4) 

Brouwer 

0.7670 

0.6526 

Brouwer 

0.2159 

0.1639 


NDC 

1.0000 

0.8085 

ACI 

1.0000 

0.3008 


Anderson 

0.4976 

0.4976 

Anderson 

-0.0048 

-0.0049 

Sonar 

Campello 

0.5072 

0.5000 

Campello 

0.0144 

0.0000 

(C=2) 

Brouwer 

0.9924 

0.9962 

Brouwer 

0.0070 

0.0000 


NDC 

1.0000 

0.9647 

ACI 

1.0000 

-0.0001 


Anderson 

0.5003 

0.5025 

Anderson 

0.0007 

0.0050 

Pima 

Campello 

0.5333 

0.5265 

Campello 

0.0665 

0.0528 

(C=2) 

Brouwer 

0.9365 

0.8039 

Brouwer 

0.0648 

0.0407 


NDC 

1.0000 

0.8125 

ACI 

1.0000 

0.1431 


Anderson 

0.6167 

0.6160 

Anderson 

0.1321 

0.1305 

Iris 

Campello 

0.6552 

0.6477 

Campello 

0.3005 

0.2860 

(C=3) 

Brouwer 

0.7211 

0.7229 

Brouwer 

0.4096 

0.4107 


NDC 

1.0000 

0.9780 

ACI 

1.0000 

0.9298 


Anderson 

0.4989 

0.4997 

Anderson 

-0.0022 

-0.0005 

Ionosphere 

Campello 

0.5185 

0.5168 

Campello 

0.0371 

0.0336 

(C=2) 

Brouwer 

0.9596 

0.9028 

Brouwer 

0.0390 

0.0440 


NDC 

1.0000 

0.8979 

ACI 

1.0000 

0.2488 


Anderson 

0.7333 

0.7317 

Anderson 

0.0103 

0.0108 

Flags 

Campello 

0.5823 

0.5573 

Campello 

0.1575 

0.0938 

(C=8) 

Brouwer 

0.7300 

0.5301 

Brouwer 

0.3437 

0.1202 


NDC 

1.0000 

0.6828 

ACI 

1.0000 

0.1448 
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6 Concluding remarks 


In this paper we proposed the adjust ed version of the norma lized degree of 
concordance index (NDC) defined by iHiillermeier et all (|2012h for comparing 
fuzzy partitions, namely the Adjusted Concordance Index (ACI). This measure 
is cons tructed upon a simi l ar rea soning of the well known Adjusted Rand in¬ 
dex by 


Hubert and Arabiel (|l985h when applied to compare hard partitions. 


We derived the proposed index by normalizing the difference between NDC 
and its expected value obtained by considering a large number of permutation 
of the similarities considered in the similarity matrices. Experimental evalua¬ 
tions show that ACI returns more coherent results than the NDC in comparing 
fuzzy partitions when the aim is to validate clustering solutions, because it 
takes into account the possible randomness component of the similarity mea¬ 
sure. The same approach cannot be applied to the other indexes because them 
are not reflexive and their maximum value is not known. Moreover, when 
comparing a fuzzy and a crisp partition, A CI is closely related to the adjusted 
Rand index for fuzzy partitions defined by ICampellol (120070 : iBrouweil (|2009r) : 


Anderson et all (|2010h . It should be noted that ACI can be used in comparing 


both fuzzy and crisp and only fuzzy partition s. Furthermore, the app roach for 
dealing with fuzzy partitions introduced by Hullermeier et all ( 2012 ) ensures 
that ACI shows properties of a proper metric under certain assumptions (i.e. 
Ruspinis partitions, probabilistic partitions). 

In addition, using the PD-clustering approach by lBen-Israel and lyigunl (|2008h . 
we were able to build the reference true probabilistic partitions for some real 
data set to show the ACI potentialities as external validity measure. 
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